adder neural network
Kernel Based Progressive Distillation for Adder Neural Networks
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using $\ell_1$-norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop. The similarity is conducted in a higher-dimensional space to disentangle the difference of their distributions using a kernel based method. Finally, the desired ANN is learned based on the information from both the ground-truth and teacher, progressively. The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. For instance, the ANN-50 trained using the proposed PKKD method obtains a 76.8\% top-1 accuracy on ImageNet dataset, which is 0.6\% higher than that of the ResNet-50.
An Empirical Study of Adder Neural Networks for Object Detection
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications. Compared with classification, there is a strong demand on reducing the energy consumption of modern object detectors via AdderNets for real-world applications such as autonomous driving and face detection. In this paper, we present an empirical study of AdderNets for object detection. We first reveal that the batch normalization statistics in the pre-trained adder backbone should not be frozen, since the relatively large feature variance of AdderNets. Moreover, we insert more shortcut connections in the neck part and design a new feature fusion architecture for avoiding the sparse features of adder layers. We present extensive ablation studies to explore several design choices of adder detectors. Comparisons with state-of-the-arts are conducted on COCO and PASCAL VOC benchmarks. Specifically, the proposed Adder FCOS achieves a 37.8% AP on the COCO val set, demonstrating comparable performance to that of the convolutional counterpart with an about $1.4\times$ energy reduction.
Supplementary Material: Progressive Kernel Based Knowledge Distillation for Adder Neural Networks
Thus, Eq.(7) in the main paper can be written as: e Thus, the transformation in Eq.(7) in the main paper can be expressed as a linear combination of In this section, more experimental results of PKKD are conducted. A T [5] on ResNet-20 using CIFAR-10 dataset as shown in Tab. 1. Table 1: Compared with other methods on ResNet-20 using CIFAR-10 dataset.PKKD ANN + dropout Snapshot-KD [3] SP-KD [2] Gift-KD [4] A T [5] 92.96% 92.20% 92.33% 92.38% 92.22% 92.27% Then, we show the superiority of the proposed methods on the traditional CNN distillation. The results are shown in Tab. 2. Table 2: PKKD and KD in CNN distillation. NP' stands for using progressive or fixed teacher.
Supplementary Materials: An Empirical Study of Adder Neural Networks for Object Detection Xinghao Chen
We also tried to utilize these tricks for training CNN-based object detectors. As shown in Table B, these tricks bring 0.2%-0.6% On contrast, this strategy improves the adder detector for 1.2% mAP, which indicates that the It is an interesting topic to explore the robustness to the domain shift for AdderNet-based detector. Figure 1: Qualitative results of RetinaNet [2], FCOS [3] and our proposed Adder FCOS. As shown in Table C, Adder FCOS suffers from 2.2% mAP drop on Cityscapes compared with convolutional counterpart, which is similar with the performance drop on COCO.
Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks
Weaknesses: The effectiveness of the kernel method, one of the claimed contributions, is not fully justified. As shown in Table 1, the kernel operation brings insignificant gain on CIFAR 10 with a shallower network of ResNet-20. The gains (below 0.21%) seems insignificant, which may be due to stochastic initialization of networks, suggesting that the proposed kernel scheme may not be so effective as advocated. I advised that comparison on ImageNet with a deeper network (e.g., ResNet-50) is performed. The current experiments are not strong to support that the proposed method is a competitive knowledge distillation method.
Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks
I believe that by bridging the gap between Adder NN and CNNs this work provides a considerable contribution, allowing Adder NN to be considered among practical architecture and encouraging the community to research them further. In accordance with the reviewers, I think the proposed method is thoroughly investigated empirically. Please make sure to update the paper with all the results and answers that you have provided in your rebuttal.
Kernel Based Progressive Distillation for Adder Neural Networks
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using \ell_1 -norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop.
An Empirical Study of Adder Neural Networks for Object Detection
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications. Compared with classification, there is a strong demand on reducing the energy consumption of modern object detectors via AdderNets for real-world applications such as autonomous driving and face detection. In this paper, we present an empirical study of AdderNets for object detection. We first reveal that the batch normalization statistics in the pre-trained adder backbone should not be frozen, since the relatively large feature variance of AdderNets. Moreover, we insert more shortcut connections in the neck part and design a new feature fusion architecture for avoiding the sparse features of adder layers.